This course will be an introduction to R and Github.
Here is the link to my IODS page
link
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
learning2014 <- read.csv("learning2014.csv", header=TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Deep : int 43 35 42 42 44 57 46 39 52 48 ...
## $ Stra : int 27 22 29 25 29 29 18 32 34 28 ...
## $ Surf : int 31 38 27 27 34 29 23 34 26 36 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 8
The “learning2014” dataset constitutes of 166 observations and 8 variables. The variables (gender, Age, Attitude, Deep, Stra, Surf and Points) describe the results of a query study made as a collaboration international project with teachers.
summary(learning2014)
## X gender Age Attitude Deep
## Min. : 1.00 F:110 Min. :17.00 Min. :14.00 Min. :19.00
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:40.00
## Median : 83.50 Median :22.00 Median :32.00 Median :44.00
## Mean : 83.50 Mean :25.51 Mean :31.43 Mean :44.16
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:49.00
## Max. :166.00 Max. :55.00 Max. :50.00 Max. :59.00
## Stra Surf Points
## Min. :10.00 Min. :19.00 Min. : 7.00
## 1st Qu.:21.00 1st Qu.:29.00 1st Qu.:19.00
## Median :25.50 Median :34.00 Median :23.00
## Mean :24.97 Mean :33.45 Mean :22.72
## 3rd Qu.:29.00 3rd Qu.:38.00 3rd Qu.:27.75
## Max. :40.00 Max. :52.00 Max. :33.00
There are 110 female and 56 male students. The median age is 22 years and range [17-55]. Students got a median of 44 points [19-59] in the deep approach (Deep), 25.50 [10-40] in the strategic approach (Stra), 34 points [19-52] in the surface approach (Surf). Their max points were 23 [7-33] as a median (Points).
pairs(learning2014[-1], col=learning2014$gender)
The data is visualised according to gender (males as black dots and females as red dots) on multiple bivariable correlation scatter plots.
The data can also be visualised using the “GGally” and “ggplot2” packages
library("GGally")
library("ggplot2")
p <- ggpairs(learning2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
Now we can see that male students have a higher score in attitude questions and female students in strategic questions. We can also see that Variables Points and Attitude have the highest (cor: 0.43) and Surf and Deep second highest correlation factor (cor: 0.32).
The “Points” student achieve is tested with a multivariate linear regression model using Attitude, Stra and Surf as independent variables. The model tries to describe any linear correlation between the independent and dependent variables.
points_regression <- lm(Points ~ Attitude + Stra + Surf, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## Stra 0.10664 0.06770 1.575 0.11716
## Surf -0.04884 0.06678 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In the first model with all three variables, Attitude explains the Points with a coefficient estimate of 0.34 (p<0.0001). Other variables Stra and Surf do not explain Points with a statistical significance. Thus, first Surf is removed and the model is tested again.
points_regression <- lm(Points ~ Attitude + Stra, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude + Stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## Stra 0.11421 0.06681 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Attitude is still the only relevant variable and, thus, Stra is removed and the model is run with Attitude alone (coefficient 0.35, p<0.0001).
points_regression <- lm(Points ~ Attitude, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The adjusted R-squared score is 0.19 meaning that using this model, Attitude explains 19% of the total variance of the dependent variable Points.
plot(points_regression, which = c(1,2,5))
Residuals of the linear regression model describe the validity of the model. Residuals should be normally distributed, they should not correlate with each others, have a constant varaiance and their size should not depend on the size of the independent variable.
The QQ-plot describes the distribution of the residuals. Our model is normally divided as the measurements are set on straight increasing line.
The Residuals vs Fitted plot describes how the residuals depend on the explanatory variable. According to the plot, the resudials are resonably evenly distributed on both sides of the residual level = 0 line and, thus, do not depend on the size of the explanatory variable.
The Residuals vs Leverage plot can help identify, which observations have an unusually high impact on the model. According to the plot, all measurements are divided between [0,0.05] and thus none have an unusual impact on the model.
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
alc <- read.csv("alc.csv", header = TRUE)
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use" "edu"
dim(alc)
## [1] 382 37
The data constitutes of 382 observations and 36 variables. Their names are presented in the above and describe the correlation between alcohol usage and the social, gender and study time attributes for each student.
Data Set Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 nursery - attended nursery school (binary: yes or no)
13 internet - Internet access at home (binary: yes or no)
14 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
15 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
16 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
17 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
18 schoolsup - extra educational support (binary: yes or no)
19 famsup - family educational support (binary: yes or no)
20 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
21 activities - extra-curricular activities (binary: yes or no)
22 higher - wants to take higher education (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
34 alc_use - average of ‘Dalc’ and ‘Walc’
35 high_use - TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
High amount of ‘absences’, no ‘famsup’, low ‘studytime’ and bad ‘famrel’ correlates with a higher alcohol consumption
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
g1 <- ggplot(alc, aes(x = alc_use, y = absences, fill = sex))
g1 + geom_bar(stat="identity") + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = alc_use, y = (famsup = "yes"), fill=sex))
g2 + geom_bar(stat="identity") + ylab("family educational support") + ggtitle("Student family educational support by alcohol consumption")
g3 <- ggplot(data = alc, aes(x = alc_use, y = studytime, fill=sex))
g3 + geom_bar(stat="identity") + ylab("weekly study time") + ggtitle("Student weekly study time by alcohol consumption and sex")
g4 <- ggplot(alc, aes(x = alc_use, y = famrel, fill = sex))
g4 + geom_bar(stat="identity") + ylab("quality of family relationships") + ggtitle("Student quality of family relationships by alcohol consumption and sex")
alc %>% group_by(famsup) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 2 × 3
## famsup count mean_alc_use
## <fctr> <int> <dbl>
## 1 no 144 1.965278
## 2 yes 238 1.842437
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 9 × 3
## alc_use count mean_absences
## <dbl> <int> <dbl>
## 1 1.0 140 3.357143
## 2 1.5 69 4.231884
## 3 2.0 59 3.915254
## 4 2.5 44 6.431818
## 5 3.0 32 6.093750
## 6 3.5 17 5.647059
## 7 4.0 9 6.000000
## 8 4.5 3 12.000000
## 9 5.0 9 6.888889
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## # A tibble: 9 × 3
## alc_use count mean_studytime
## <dbl> <int> <dbl>
## 1 1.0 140 2.307143
## 2 1.5 69 1.971014
## 3 2.0 59 1.983051
## 4 2.5 44 1.931818
## 5 3.0 32 1.718750
## 6 3.5 17 1.470588
## 7 4.0 9 1.777778
## 8 4.5 3 2.000000
## 9 5.0 9 1.666667
alc %>% group_by(famrel) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
## famrel count mean_alc_use
## <int> <int> <dbl>
## 1 1 8 2.250000
## 2 2 19 2.184211
## 3 3 64 2.000000
## 4 4 189 1.880952
## 5 5 102 1.750000
Lack of family educational support and bad quality of family relationships seem to correlate with higher alcohol consumption. Additionally, high alcohol consumption correlates with high absences from school. In addition, students with high alcohol consumption (alc_use > 3.0) study less than students with low alcohol comsumption (alc_use < 2.5). These findings are concordant with the primary hypotheses.
alc$famrel <- factor(alc$famrel)
alc$studytime <- factor(alc$studytime)
m <- glm(high_use ~ absences + famsup + alc$studytime + alc$famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + famsup + alc$studytime +
## alc$famrel, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1803 -0.8226 -0.6503 1.1522 2.2409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.015455 0.878444 -1.156 0.247693
## absences 0.076934 0.022630 3.400 0.000675 ***
## famsupyes -0.061188 0.245702 -0.249 0.803337
## alc$studytime2 -0.439698 0.267726 -1.642 0.100520
## alc$studytime3 -1.342617 0.442054 -3.037 0.002388 **
## alc$studytime4 -1.279644 0.588173 -2.176 0.029583 *
## alc$famrel2 0.936030 0.964897 0.970 0.332005
## alc$famrel3 0.643055 0.883261 0.728 0.466585
## alc$famrel4 0.272456 0.858654 0.317 0.751012
## alc$famrel5 -0.006842 0.876748 -0.008 0.993773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 427.24 on 372 degrees of freedom
## AIC: 447.24
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3622376 0.04877998 1.8120704
## absences 1.0799706 1.03533984 1.1316272
## famsupyes 0.9406468 0.58232256 1.5284716
## alc$studytime2 0.6442312 0.38110876 1.0908572
## alc$studytime3 0.2611613 0.10353967 0.5966021
## alc$studytime4 0.2781364 0.07612213 0.8066302
## alc$famrel2 2.5498396 0.42150543 21.4386194
## alc$famrel3 1.9022832 0.37564722 14.2026417
## alc$famrel4 1.3131852 0.27385709 9.4789868
## alc$famrel5 0.9931812 0.19882276 7.3471751
According to the logistic regression model higher absences and studying more than 5 hours a week correlates with lower alcohol consumption. Lack of family support and quality of family relationships do not correlate significantly with alcohol cosumption in this model.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 94 20
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.66230366 0.03926702
## TRUE 0.24607330 0.05235602
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.24607330 0.05235602 0.29842932
## Sum 0.90837696 0.09162304 1.00000000
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2853403
According to the results, 29% of the predictions were wrong. This is clearly less than simple guessing (50% predictions wrong), although not optimal.
library(boot)
#cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
#cv
#cv$delta[1]
The test set performance is worse as the model makes 31% of predictions wrong.
#setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data frame has 506 rows and 14 columns. The data frame contains the following variables: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv.
pairs(Boston)
library(dplyr)
cor_matrix<-cor(Boston)
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
cor_matrix %>% round(2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
#install.packages("corrplot")
library(corrplot)
corrplot(cor_matrix %>% round(2), method="circle")
corrplot(cor_matrix %>% round(2), method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
There seem to be a strong negative correlation between variables age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres), nox (nitrogen oxides concentration) and dis, indus (proportion of non-retail business acres per town) and dis and lstat (lower status of the population) and medv (median value of owner-occupied homes in $1000s). On the other hand, there is strong positive correalation between indus and nox, indus and tax (full-value property-tax rate per $10,000), nox and age, nox and tax, rm (average number of rooms per dwelling) and medv and rad (index of accessibility to radial highways) and tax.
By looking at the difference between the medians and means and at their value respective to the min. and max. values, it seems that variables crim, zn, age, rad, tax, indus and black would be distributed non-parametrically and others perhaps parametrically. However, to make such conclusions, one should first run a test for normality (for instance Kolmogorov-Smirnov).
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaling reduces the mean of each of the values. Thus, after scaling, the variables have all 0 as a mean and it is easier to inspect the distribution. The closer the median is to 0, the more parametrical the variable. Scaling does not transform the data such as logarithmic trasformation.
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
#remove original crim from the dataset
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
library(lda)
lda.fit <- lda(crime~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2698020 0.2450495 0.2277228 0.2574257
##
## Group means:
## zn indus chas nox rm
## low 0.95681457 -0.9352889 -0.127848325 -0.8827690 0.4386072
## med_low -0.08315566 -0.2430318 0.006051757 -0.5048785 -0.1512355
## med_high -0.38377579 0.1394693 0.198411178 0.3043928 0.1748530
## high -0.48724019 1.0170690 -0.007331936 1.0781191 -0.3926115
## age dis rad tax ptratio black
## low -0.8927844 0.8879153 -0.6952813 -0.7317611 -0.4820477 0.3786823
## med_low -0.2429419 0.2552198 -0.5572864 -0.4423375 -0.0816398 0.3490755
## med_high 0.4277372 -0.3446267 -0.4251144 -0.3287766 -0.2355170 0.1811425
## high 0.7991204 -0.8367862 1.6386213 1.5144083 0.7813507 -0.7125453
## lstat medv
## low -0.78074623 0.51135373
## med_low -0.08460568 -0.01894297
## med_high -0.07367346 0.24107520
## high 0.91891561 -0.72411156
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.116441085 0.69779190 -0.75458733
## indus 0.038513302 -0.31620459 0.39101804
## chas -0.062071500 -0.05880809 0.11829047
## nox 0.437756382 -0.68614657 -1.54840350
## rm -0.149262567 -0.06059328 -0.15826392
## age 0.224347127 -0.46597807 -0.18525818
## dis 0.001093547 -0.34425723 -0.02177413
## rad 3.370162180 0.86904316 -0.06551355
## tax 0.003589327 0.09553562 0.55778282
## ptratio 0.102061530 0.02177559 -0.26053904
## black -0.139812591 0.04299155 0.22243764
## lstat 0.196710187 -0.17094701 0.45603266
## medv 0.184517380 -0.40604132 -0.25638771
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9560 0.0335 0.0104
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 9 7 2 0
## med_low 6 19 2 0
## med_high 1 11 20 2
## high 0 0 0 23
According to the results, the model predicts well the true values of the categorical crime variable. Most of the errors are related to the med_low group where the specificity is the lowest.
dist_eu <- dist(boston_scaled)
## Warning in dist(boston_scaled): NAs introduced by coercion
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5270 4.9080 4.9390 6.2420 13.0000
dist_man <- dist(boston_scaled, method = "manhattan")
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)
km <-kmeans(dist_eu, centers = 3)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3
## 0.4426877 0.2075099 0.3498024
##
## Group means:
## zn indus chas nox rm age
## 1 -0.2309344 -0.4551440 -0.27232907 -0.4915087 -0.1311176 -0.2883074
## 2 1.3140079 -0.9068835 0.47759479 -0.7455383 1.0548777 -0.7434166
## 3 -0.4872402 1.1139831 0.06132349 1.0642908 -0.4598408 0.8058735
## dis rad tax ptratio black lstat
## 1 0.2864417 -0.5768315 -0.6147205 -0.006886361 0.3154274 -0.2702405
## 2 0.7952165 -0.5935800 -0.6372993 -0.976736455 0.3398644 -0.8793299
## 3 -0.8342410 1.0821251 1.1560103 0.588134874 -0.6007995 0.8636357
## medv crimemed_low crimemed_high crimehigh
## 1 -0.02274038 0.43750000 0.2589286 0.0000000
## 2 1.16160305 0.16190476 0.2761905 0.0000000
## 3 -0.66030777 0.06214689 0.2203390 0.7175141
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn 0.172610013 -1.25618179
## indus 1.078970813 -0.12226466
## chas 0.004463932 -0.49433521
## nox 0.921052125 -0.16586799
## rm -0.040950198 -0.49922411
## age -0.067015888 0.07618846
## dis 0.138379223 -0.06104503
## rad 0.579745765 0.43033468
## tax 0.401343141 -0.66310399
## ptratio 0.374368029 0.11727097
## black -0.061203934 0.05315043
## lstat 0.376322744 -0.56677965
## medv 0.109367758 -0.69236403
## crimemed_low -0.389791316 0.15910228
## crimemed_high -0.486827228 -0.58792893
## crimehigh -0.207144258 -1.08365331
##
## Proportion of trace:
## LD1 LD2
## 0.7951 0.2049
plot(lda.fit, dimen = 2)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Using 3 clusters for K-means analysis, the most influencial linear separators are nox and chas. This means that if the data would be divided in 3 classes, variables nox and chas would explain most of the dimentional difference and would thus be the best variables to predict possible new observations.
model_predictors <- dplyr::select(train, -crime)
###Check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 16 2
###Matrix multiplication
#matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
#library("plotly")
###Create a 3D plot (Cool!) of the columns of the matrix product
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$center)
The same results as above is visualised in 3D using the variable crime in the first plot and the number of clusters selected in the second to separate observations with colors. For some reason, the visualisation does not work after knitting the file.
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
human2 <- read.csv("human2.csv")
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
colnames(human2)[1] <- "Country"
human2 <- human2[ ,2:9]
#human2 <- select(human2, -Country)
dim(human2)
## [1] 155 8
str(human2)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
There are 9 variables and 155 observations.
The original data is from http://hdr.undp.org/en/content/human-development-index-hdi and has been retrieved, modified and analyzed by Tuomo Nieminen. The data combines several indicators from most countries in the world.
“Country” = Country name
Health and knowledge
“GNI” = Gross National Income per capita
“Life.Exp” = Life expectancy at birth
“Edu.Exp” = Expected years of schooling
“Mat.Mor” = Maternal mortality ratio
“Ado.Birth” = Adolescent birth rate
Empowerment
“Parli.F” = Percetange of female representatives in parliament
“Edu2.F” = Proportion of females with at least secondary education
“Edu2.M” = Proportion of males with at least secondary education
“Labo.F” = Proportion of females in the labour force
“Labo.M” " Proportion of males in the labour force
“Edu2.FM” = Edu2.F / Edu2.M
“Labo.FM” = Labo2.F / Labo2.M
human2 <- mutate(human2, GNI=str_replace(human2$GNI, pattern=",", replace ="") %>% as.numeric())
pairs(human2)
summary(human2)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(GGally)
library(corrplot)
#ggpairs(data=human2)
cor(human2) %>% corrplot()
Variables “Edu2.FM”, “Labo.FM”, “Life.Exp”, “Edu.Exp” and “Parli.F” seem to be parametrical as their mean and median are close to each others. On the other hand, “GNI”, “Mat.Mor” and “Ado.Birth” have quite differing median and mean values indicating the distribution is not parametrical.
Variables “Life.Exp” and “Edu.Exp”, “Mat.Mor” and “Ado.Birth” would have a positive correlation. Parameters “Life.Exp” and “Mat.Mor”, “Edu.Exp” and “Mat.Mor”, “Life.Exp” and “Ado.Birth”, and “Edu.Exp” and “Ado.Birth” have negative correlation.
pca_human <- prcomp(human2)
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("grey40", "deeppink2"))
human_scaled <- scale(human2)
pca_human <- prcomp(human_scaled)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
#Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)
#Print out the percentages of variance
round(pca_pr*100, digits=1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pca_pr <- round(pca_pr*100, digits=1)
#Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)" "PC4 (7.6%)" "PC5 (5.5%)"
## [6] "PC6 (3.6%)" "PC7 (2.6%)" "PC8 (1.3%)"
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The results differ. The PCA plot of the unscaled data suggest that there is only one variable that would explain the variance of the two principal component and it is unilateral to PC1. After scaling, all variables are displayed clearly in the plot and pointing each in different directions and with an almost equal amplitude. The scaling reduces the difference between values of different variables and thus makes the variables comparable.
Adolescent birth rate and maternal mortality points to an increasing PC1 and Life expectancy, education expectancy and the proportion of females to males with at least secondary education points to the opposing direction. This is understandable as the opposing arrows describe phenomena of developing and developed countries, respectively. Perpendicular to these variables are the percentage of female representatives in parliament and proportion of females to male in the labour force, which are pointing to an increasing principal component PC2.
PC1 describes a dimension of health and knowledge. Variables linear to PC1 either increases wellbeing or education, which have a high correlation together as was described earlier in the correlation matrix. PC2 describes female empowerment and is independent from the themes of PC1. The correlation plot sustain this notion as variables linear with PC2 correlate positively with each others but negatively with variables linear with PC1.
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
tea_time <- data.frame(tea$Tea, tea$How, tea$how, tea$sugar, tea$where, tea$lunch)
summary(tea_time)
## tea.Tea tea.How tea.how tea.sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## tea.where tea.lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ tea.Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ tea.How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ tea.how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ tea.sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ tea.where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ tea.lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
###Run MCA
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## tea.Tea | 0.126 0.108 0.410 |
## tea.How | 0.076 0.190 0.394 |
## tea.how | 0.708 0.522 0.010 |
## tea.sugar | 0.065 0.001 0.336 |
## tea.where | 0.702 0.681 0.055 |
## tea.lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage="quali")
The variables are distributed to a biplot of two principal dimensions explaining 15.3% and 14.2% of the total variance, respectively.
In this assignment, I will analyse how alcohol consumption and other phenomena affect school grades. For the analysis, I will use linear regression model and the MCA plot. According to the linear regression model, students studying more and students with no romantic relationships are more likely to have higher school grades. The MCA plot nicely validates the role of romantic relationships.
Using the youth alcohol consumption data downloaded from the UCI Machine Learning Repository, I am going to study the variance of different variables using MCA and the effect of different variables to grades using linear regression.
My hypothesis is that high amount of ‘absences’, no ‘famsup’, low ‘studytime’ and bad ‘famrel’ correlates with a higher alcohol consumption
The data wrangling are found [here] (https://github.com/obruck/IODS-project/blob/master/data/alc_data%20wrangling.R)
The data files “mat” and “por” have been joined by using common variables as identifiers. Then, duplicate answers have been omitted. Three new variables have been created. “alc_use” is the total weekly alcohol consumption, “edu” stands for female and male education and “high_use” corresponds to students abusing high alcohol amounts.
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
alc <- read.csv("alc.csv")
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use" "edu"
dim(alc)
## [1] 382 37
str(alc)
## 'data.frame': 382 obs. of 37 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ edu : int 8 2 2 6 6 7 4 8 5 7 ...
The data constitutes of 382 observations and 36 variables. They represent the correlation between alcohol usage and social, gender and study time attributes for each student.
Data Set Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) 8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) 9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 nursery - attended nursery school (binary: yes or no)
13 internet - Internet access at home (binary: yes or no)
14 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
15 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
16 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
17 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
18 schoolsup - extra educational support (binary: yes or no)
19 famsup - family educational support (binary: yes or no)
20 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
21 activities - extra-curricular activities (binary: yes or no)
22 higher - wants to take higher education (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
34 alc_use - average of ‘Dalc’ and ‘Walc’
35 high_use - TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
summary(alc)
## X school sex age address famsize
## Min. : 1.00 GP:342 F:198 Min. :15.00 R: 81 GT3:278
## 1st Qu.: 96.25 MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104
## Median :191.50 Median :17.00
## Mean :191.50 Mean :16.59
## 3rd Qu.:286.75 3rd Qu.:17.00
## Max. :382.00 Max. :22.00
## Pstatus Medu Fedu Mjob Fjob
## A: 38 Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## T:344 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use edu
## Mode :logical Min. :1.000
## FALSE:268 1st Qu.:4.000
## TRUE :114 Median :6.000
## NA's :0 Mean :5.372
## 3rd Qu.:7.000
## Max. :8.000
library(dplyr)
library(corrplot)
alc_num <- data.frame(alc$age, alc$Medu, alc$Fedu, alc$traveltime, alc$studytime, alc$failures, alc$famrel, alc$freetime, alc$goout, alc$Dalc, alc$Walc, alc$health, alc$absences, alc$G1, alc$G2, alc$G3, alc$alc_use)
corrplot(cor(alc_num), method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
The data is composed of 198 females and 184 males. Normality of data is determined based on suffient n number (here n=382 is observed as sufficient) and on how close median and mean values are of each others. Variables “age”, “Medu”, “studytime”, “failures”, “famrel”, “freetime”, “goout”, “Dalc”, “Walc”, “G1”, “G2”, “G3”, “alc_use” are normally distributed. In addition, varibles “school”, “sex”, “address”, “famsize”, “Pstatus”, “Mjob”, “Fjob”, “reason”, “nursery”, “internet”, “guardian”, “schoolsup”, “famsup”, “paid”, “activities”, “higher”, “romantic”, “high_use” are binomial or multinomial.
As correlation plots can be done only with numeric variables, only these are encoded to the “alc_num” data.frame. Their correlation are then assessed with a correlation plot. The education level of mothers and fathers of student correlate positively. The same is seen between weekday and weekend alcohol consumption as well as the school grade of the first period, second period and the final grade. On the other hand, the number of past class failures correlate negatively with school grades.
The distribution of nominal variables are assessed next. As can be depicted, the number of males and females, the number of students with or without extracurricular activities and the number of students with or without extra paid classes within the course subject are roughly similar. However, more students live in urban areas, have a family size of >3 members, do not receive family educational support, do not consume high amounts of alcohol, want to take higher education, do not engage in romantic relationships, are in the Gabriel Pereira school and do not receive extra educational support.
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
alc_nom <- data.frame(alc$school, alc$sex, alc$address, alc$famsize, alc$paid, alc$schoolsup, alc$famsup, alc$activities, alc$higher, alc$romantic, alc$high_use)
gather(alc_nom) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
Then, we explore the impact of the education of mothers, fathers and their summed education to alcohol use and final grade. Parent education doesn’t seem to affect alcohol grade but have a clear positive correlation with higher mean grades.
alc %>% group_by(Medu) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
## Medu count mean_alc_use
## <int> <int> <dbl>
## 1 0 3 2.166667
## 2 1 51 1.950980
## 3 2 98 1.739796
## 4 3 95 2.021053
## 5 4 135 1.874074
alc %>% group_by(Fedu) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
## Fedu count mean_alc_use
## <int> <int> <dbl>
## 1 0 2 1.000000
## 2 1 77 1.941558
## 3 2 105 1.895238
## 4 3 99 1.787879
## 5 4 99 1.959596
alc %>% group_by(edu) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 8 × 3
## edu count mean_alc_use
## <int> <int> <dbl>
## 1 1 2 1.000000
## 2 2 36 1.958333
## 3 3 39 1.884615
## 4 4 68 1.933824
## 5 5 41 1.768293
## 6 6 63 1.888889
## 7 7 58 1.775862
## 8 8 75 1.993333
alc %>% group_by(Medu) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 5 × 3
## Medu count mean_grade
## <int> <int> <dbl>
## 1 0 3 12.66667
## 2 1 51 10.09804
## 3 2 98 10.79592
## 4 3 95 11.49474
## 5 4 135 12.40000
alc %>% group_by(Fedu) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 5 × 3
## Fedu count mean_grade
## <int> <int> <dbl>
## 1 0 2 13.50000
## 2 1 77 10.07792
## 3 2 105 11.56190
## 4 3 99 11.72727
## 5 4 99 12.11111
alc %>% group_by(edu) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 8 × 3
## edu count mean_grade
## <int> <int> <dbl>
## 1 1 2 13.500000
## 2 2 36 9.861111
## 3 3 39 10.307692
## 4 4 68 10.867647
## 5 5 41 11.560976
## 6 6 63 12.063492
## 7 7 58 12.120690
## 8 8 75 12.226667
Linear regression is a statistical method to predict a continuous variable (dependent varaible) with one or multiple variables (independent variables). If only one explanatory variable is used, the model is called simple linear regression. Otherwise, the process is called multiple linear regression. Linear regression takes into account the multicollinearity of indenpendent models and computes their “true” correlation coefficient to the dependent variable. Eventually, one can make a formula y = b1x1 + … + bnxn + C, where y = response variable (dependent variable), x = covariates, b = correlation coefficient of covariates, C = error term.
Multiple correspondence analysis (MCA) is a visualisation method to analyse categorical data and the connections between different variables and their values. The data is transposed onto a two-dimensional plot where the x-axis and y-axis represent the components explaining the most of the variance of the data. Then, the values of the selected variable are distributed on the plot based on their correlation with the principal components and other variables. Unlike PCA or LDA, MCA compares the different types of values of different categorical variables and not the exact values of each observations.
First let’s run a linear regression analysis where the final grade is the response variable and gender, tendency to go out with friends, the number of absences, alcohol consumption, family educational support, studytime and romantic relationships are the independent variables.
alc$studytime <- factor(alc$studytime)
m <- lm(G3 ~ sex + goout + absences + alc_use + famsup + studytime + romantic, data = alc)
summary(m)
##
## Call:
## lm(formula = G3 ~ sex + goout + absences + alc_use + famsup +
## studytime + romantic, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5079 -1.8658 0.2236 2.1942 7.2735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.09832 0.68053 17.778 < 2e-16 ***
## sexM 0.52857 0.36391 1.452 0.147207
## goout -0.28159 0.16077 -1.751 0.080691 .
## absences -0.01752 0.03153 -0.556 0.578758
## alc_use -0.27792 0.19608 -1.417 0.157210
## famsupyes -0.07760 0.34674 -0.224 0.823046
## studytime2 0.90464 0.41559 2.177 0.030127 *
## studytime3 1.94795 0.57101 3.411 0.000717 ***
## studytime4 1.57705 0.72135 2.186 0.029421 *
## romanticyes -0.77491 0.36009 -2.152 0.032042 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.218 on 372 degrees of freedom
## Multiple R-squared: 0.077, Adjusted R-squared: 0.05467
## F-statistic: 3.448 on 9 and 372 DF, p-value: 0.0004245
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.795706e+05 4.710642e+04 6.845268e+05
## sexM 1.696507e+00 8.294443e-01 3.469956e+00
## goout 7.545849e-01 5.500615e-01 1.035154e+00
## absences 9.826307e-01 9.235547e-01 1.045485e+00
## alc_use 7.573586e-01 5.150550e-01 1.113652e+00
## famsupyes 9.253378e-01 4.679389e-01 1.829833e+00
## studytime2 2.471031e+00 1.091375e+00 5.594774e+00
## studytime3 7.014305e+00 2.282198e+00 2.155838e+01
## studytime4 4.840670e+00 1.171881e+00 1.999528e+01
## romanticyes 4.607442e-01 2.269591e-01 9.353457e-01
Then, we shall analyze visually the final school grade, gender, romantic relationships and extracurricular activities using the MCA plot.
####Transform the G3 variable to categorical
attach(alc)
alc$G3cat[G3 <= 10] <- "Low grade"
alc$G3cat[G3 > 10 & G3 <= 14] <- "Intermediate grade"
alc$G3cat[G3 > 14] <- "High grade"
detach(alc)
str(alc)
## 'data.frame': 382 obs. of 38 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ edu : int 8 2 2 6 6 7 4 8 5 7 ...
## $ G3cat : chr "Low grade" "Low grade" "Intermediate grade" "Intermediate grade" ...
library(FactoMineR)
alc_mca <- data.frame(alc$G3cat, alc$romantic, alc$high_use, alc$sex, alc$activities)
mca <- MCA(alc_mca, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = alc_mca, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.253 0.226 0.208 0.195 0.173 0.146
## % of var. 21.067 18.812 17.344 16.229 14.386 12.163
## Cumulative % of var. 21.067 39.878 57.222 73.451 87.837 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.192 0.038 0.040 | 0.707 0.580 0.544 | -0.457
## 2 | -0.192 0.038 0.040 | 0.707 0.580 0.544 | -0.457
## 3 | 0.308 0.098 0.078 | 0.468 0.254 0.180 | -0.013
## 4 | -0.601 0.374 0.320 | -0.087 0.009 0.007 | 0.853
## 5 | -0.366 0.139 0.161 | 0.357 0.147 0.153 | -0.071
## 6 | 0.103 0.011 0.013 | -0.679 0.535 0.563 | 0.171
## 7 | 0.138 0.020 0.022 | -0.065 0.005 0.005 | -0.084
## 8 | -0.192 0.038 0.040 | 0.707 0.580 0.544 | -0.457
## 9 | -0.207 0.044 0.028 | -0.340 0.134 0.075 | -0.878
## 10 | 0.103 0.011 0.013 | -0.679 0.535 0.563 | 0.171
## ctr cos2
## 1 0.263 0.227 |
## 2 0.263 0.227 |
## 3 0.000 0.000 |
## 4 0.915 0.644 |
## 5 0.006 0.006 |
## 6 0.037 0.036 |
## 7 0.009 0.008 |
## 8 0.263 0.227 |
## 9 0.970 0.501 |
## 10 0.037 0.036 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## High grade | -0.876 10.799 0.166 -7.954 | -0.848 11.348
## Intermediate grade | -0.010 0.003 0.000 -0.170 | -0.196 1.520
## Low grade | 0.428 5.425 0.110 6.462 | 0.637 13.478
## alc.romantic_no | 0.159 1.361 0.054 4.549 | -0.129 1.000
## alc.romantic_yes | -0.342 2.936 0.054 -4.549 | 0.277 2.157
## FALSE | -0.506 14.201 0.602 -15.138 | -0.079 0.385
## TRUE | 1.189 33.386 0.602 15.138 | 0.185 0.906
## F | -0.611 15.285 0.401 -12.362 | 0.483 10.700
## M | 0.657 16.448 0.401 12.362 | -0.519 11.514
## alc.activities_no | 0.047 0.082 0.002 0.865 | 0.767 24.726
## cos2 v.test Dim.3 ctr cos2 v.test
## High grade 0.156 -7.705 | -1.159 22.959 0.291 -10.523 |
## Intermediate grade 0.031 -3.440 | 0.653 18.323 0.345 11.468 |
## Low grade 0.243 9.625 | -0.230 1.895 0.032 -3.466 |
## alc.romantic_no 0.036 -3.684 | -0.483 15.303 0.503 -13.840 |
## alc.romantic_yes 0.036 3.684 | 1.041 33.010 0.503 13.840 |
## FALSE 0.015 -2.357 | -0.040 0.105 0.004 -1.183 |
## TRUE 0.015 2.357 | 0.093 0.247 0.004 1.183 |
## F 0.251 9.774 | 0.015 0.011 0.000 0.303 |
## M 0.251 -9.774 | -0.016 0.012 0.000 -0.303 |
## alc.activities_no 0.530 14.215 | -0.307 4.280 0.085 -5.679 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## alc.G3cat | 0.205 0.297 0.449 |
## alc.romantic | 0.054 0.036 0.503 |
## alc.high_use | 0.602 0.015 0.004 |
## alc.sex | 0.401 0.251 0.000 |
## alc.activities | 0.002 0.530 0.085 |
plot(mca, invisible=c("ind"), habillage="quali")
According to the linear regression model, students studying more and students with no romantic relationships are more likely to have higher school grades. Interestingly, the optimal studytime is between 5-10 hours/week. The adjusted R-squared value is 0.055 so the model explains only 5.5% of the variance of school grades. This is partly due to the high amount of variables included in the model. Probably other variables have an impact on school grades as well but their effect is reduced due to multicollinearity.
In the MCA plot, we can see that the two principal components (or dimensions) explain 21.1% and 18.8% of total variance. High school grade is located in the lower left quandrant (negative correlation with both principal components), intermediate grade near origo (little correlation) and low grades in the upper right quandrant (positive correlation with both principal components). Extracurricular activities (“alc.activies_yes”), low alcohol consumption (“FALSE”) and lack of romantic relationships (“alc.romantic_no”) correlate with higher school grades. On the other hand, gender (“M” and “F”) have no effect. The exact numeric correlations with the principal components can be seen the summary of the MCA analysis.
In this case, validating either method is quite challenging as there are no additional validation cohort. First, also studytime was included in the MCA plot, but the plot did not look aesthetic and therefore it was left out. On the other hand, the MCA plot nicely validates the results of the linear regression model emphasizing the negative role of romantic relationships to school grades. Luckily, life is more than school grades :)